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Abstract. This short communication uses a simple experiment to show that fitting to a power law dis- 
tribution by using graphical methods based on linear fit on the log-log scale is biased and inaccurate. It 
shows that using maximum likelihood estimation (MLE) is far more robust. Finally, it presents a new 
table for performing the Kolmogorov-Smirnof test for goodness-of-fit tailored to power-law distributions 
in which the power-law exponent is estimated using MLE. The techniques presented here will advance 
the application of complex network theory by allowing reliable estimation of power-law models from data 
and further allowing quantitative assessment of goodness-of-fit of proposed power-law models to empirical 
data. 
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1 Introduction 

In recent years, a significant amount of research has fo- 
cused on showing that many physical and social phenom- 
ena follow a power-law distribution. Some examples of 
these phenomena are the World Wide Web 0, metabolic 
networks 6 , Internet router connections py, journal paper 
reference networks |17) . and sexual contact networks |12| . 
Often, simple graphical methods are used for fitting the 
empirical data to a power-law distribution. Such graphi- 
cal analysis, based on linear fitting of log-log transformed 
data, can be grossly erroneous. 

The pure power-law distribution, known as the zeta 
distribution, or discrete Pareto distribution [x] is expressed 
as 

P(k) = (!) 



C(7)' 



where: 



k is a positive integer usually measuring some variable 
of interest, e.g., number of links per network node; 
p(k) is the probability of observing the value k; 
7 is the power-law exponent; 



C(t) is the Riemann zeta function defined as 



k=l 



It is important to note, from this definition, that 7 > 1 
for the Riemann zeta function to be finite. 

Without a quantitative measure of goodness-of-fit, it 
is difficult to assess how well data approximates a power- 
law distribution. Moreover, a quantitative analysis of the 
goodness-of-fit enables the identification of possible inter- 
esting phenomena that could be causing the distribution 



to deviate from a power-law. In some cases the underlying 
process may not actually generate power-law distributed 
data, which may instead be due to outside influences, such 
as biased data collection techniques or random bipartite 
structures ■ Quantitative assessment of the goodness-of- 
fit for the power-law distribution can assist in identifying 
these cases. 

This paper demonstrates that the current broadly used 
methods for fitting to the power-law distribution tend 
to provide biased estimates for the power-law exponent, 
while the maximum likelihood estimator (MLE) produces 
more accurate and robust estimates. Finally, MLE per- 
mits the use of a Kolmogorov-Smirnov (KS) test to assess 
goodness-of-fit. This paper provides a new KS table suit- 
able for testing power-law distributions derived from MLE 
estimation. 



2 Problems of currently used estimation 
methods 

In the literature, many researchers make parameter esti- 
mations using simple graphical methods, such as f ) direct 
linear fit of the log-log plot of the full raw histogram of the 
data [TlllH| , 2) fit of the first 5 points of the log-log plot 
of the raw histogram [5], or 3) linear fitting to logarithmi- 
cally binned histograms |2"lll6| . The easy graphical nature 
of these methods tends to mask their basic inaccuracy. In 
a simple experiment, a random deviate generator was used 
to produce a dataset of 10,000 samples from a known zeta 
distribution with exponent 7 = 2.500. The three graphical 
methods listed above were used to estimate the power-law 
exponent from the dataset. This experiment was repeated 
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Table 1. Sample results of parameter estimation using various 
methods for 10,000 samples of power-law distribution with 7 = 
2.500. Sample result based on 50 runs. 





Mean 








estimated 




Bias 


Estimation method 


7 


a 


error 


Linear 


1.590 


0.184 


36% 


Linear 5-points 


2.500 


0.045 





Log- 2 bins 


1.777 


0.038 


29% 


MLE 


2.500 


0.017 






50 times and the tabulated results are presented in Ta- 
ble n Linear fitting was performed using least squares 
regression, where the slope of the fit was used as the esti- 
mate of the exponent 7. MLE estimates of the exponent 
are also included in the table. 

This table shows that two of the methods, full linear 
fit, and linear fit on logarithmic bins, suffer from severe 
bias, with 36% and 29% bias error respectively. The most 
accurate of the three graphical methods is the linear fit 
of the first 5 points, where the estimate is based on the 
slope of the first 5 points of the distribution histogram 
in log-log scale. These first 5 points contain most of the 
data and, due to the large number of samples, they can 
decrease the bias caused by the large uneven variation in 
the tail (the log-log transformation distorts the error in 
the tail unevenly). However, the variance of this estimate 
is much higher than the variance of estimates from MLE, 
showing the stability of MLE. 

Maximum likelihood estimation of the zeta distribu- 
tion 7 maximizes the log-likelihood function given by, 
assuming independence between the data points: 

N _ 7 

£(7 I x) = logZ(7 I x) 

N 

= ^(- 7 log(^)-log(C(7))) 
i=i 

N 

= - 7 ]Tlog(a ;4 )-A r log(C(7)), (2) 
i=l 

where: 

— 2(7 I x) is the likelihood function of 7 given the un- 
binned data x = xn <=i<= ^, 

— L( 7 I x) is the log-likelihood function. 

This maximum can be obtained theoretically for the zeta 
distribution by finding the zero of the derivative of the 
log-likelihood function 

d N 1 rl 

-L( 7 I x) = -5>g(*i) - ^--Ch) = 



where: C'il) is the derivative of the Riemann Zeta func- 
tion. 

A table with the value of the ratio 0{l)K{l) can be 
found in or values can be generated on most modern 
mathematical and engineering calculation programs such 
as Matlab, Maple and Mathematica. 

Note that the parameter estimate of a power-law expo- 
nent has very limited meaning without some assessment 
of its goodness-of-fit. The KS test is a robust and simple 
goodncss-of-fit test that can be used to obtain this infor- 
mation. 

3 Using a KS-Type Goodness-of-Fit Test for 
Power- Law Distribution Hypothesis 

The two most commonly used goodness-of-fit tests are 
Pearson's \ 2 test, and the Kolmogorov-Smirnov (KS) type 
test. The Pearson's x 2 test is very simple to perform but 
has severe problems related to the choice of the number 
of classes to use ^2- Because of this, in most cases it is 
preferable to use the KS test. The KS test is based on the 
following test statistic: 

K = aup\F*(x)-S(x)\, (4) 

X 

where: 

— F*(x) is the hypothesized cumulative distribution func- 
tion 

— S(x) is the empirical distribution function based on 
the sampled data. 

Kolmogorov [Hj first supplied a table for this test statis- 
tic for the case where the hypothesized distribution func- 
tion was independent to the data, i.e., when none of the 
parameters of the hypothesized distribution function is ex- 
tracted from the data. When there is a dependency, other 
tables must be used. This limitation was not taken into 
consideration by Pao and Nicholls in their application ^] 
HB] of the KS test to power-laws. Without correcting for 
this factor, the KS test gives a rejection rate lower than 
what is expected [3]. 

Lilifoers later introduced tables for using the KS test 
with other distributions, such as normal and exponential 
[lOllllj . These tables were obtained using a Monte Carlo 
method, which is based on generating a large number of 
distributions with random parameters and calculating the 
test statistic for each of the test cases, from which empir- 
ical values for the quantiles can be extracted. The same 
procedure was used to obtain these values for the power- 
law distribution. For each of logarithmically spaced sam- 
ple sizes, 10,000 power-law distributions were simulated, 
with random exponent from 1.5 to 4.0. Statistics were col- 
lected from these simulations to generate the KS quantiles, 
shown in Table This table was created assuming MLE 
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Table 2. KS test table for power-law distributions, assuming 
MLE estimation. 



# samples 0.9 



Quant ile 
0.95 0.99 



0.999 



10 





1765 





2103 





2835 





3874 


20 





1257 





1486 





2003 





2696 


30 





1048 





1239 





1627 





2127 


40 





0920 





1075 





1439 





1857 


50 





0826 





0979 





1281 





1719 


100 





0580 





0692 





0922 





1164 


500 





0258 





0307 





0412 





0550 


1000 





0186 





0216 





0283 





0358 


2000 





0129 





0151 





0197 





0246 


3000 





0102 





0118 





0155 





0202 


4000 





0087 





0101 





0131 





0172 


5000 





0073 





0086 





0113 





0147 


10000 





0059 





0069 





0089 





0117 


50000 





0025 





0034 





0061 





0077 



as the estimation method. Separate KS tables would have 
to be constructed for other estimation methods. 

Conover j2| presents detailed instructions of how to 
use the KS table for obtaining a goodness-of-fit estimate. 
Next, a very simple practical example will be shown on 
how to use the techniques presented in this paper. 

The data set used contains 900 papers in the complex 
networks field, and the distribution tested was of the num- 
ber of papers per author, often characterized as a power- 
law known as Lotka's Law |19| . These papers were written 
and co-written by a total of 1,354 different authors. Fig- 
ure n shows the empirical distribution in a log-log plot. 
The MLE estimation can be obtained simply by calculat- 
ing jj 53j=i ^°s( x i) given in Equation [21 where Xi is the 
number of papers authored by author i. This sum in this 
data set equals to 0.2739. By using Matlab, it is possible 
to solve Equation |3 for 7, resulting in 7 = 2.544. It is also 
possible to use the table provided in [E], but it would 
result in lower precision. Figure ^ shows also the plot of 
the fitted power-law line. 

Now, in order to test if this fit is reasonable, the KS 
test can be used. This test requires the calculation of the 
maximum distance between the hypothesized cumulative 
distribution (F*(x) - a power-law distribution with ex- 
ponent 2.544), and the empirical distribution S(x). For 
this case, the test statistic obtained was K = 0.0117. The 
number of samples (number of authors) is N — 1,354. 
The closest value to N in Table is N = 1, 000 (although 
it would be statistically "safer" to choose the next high- 
est number of samples to ensure that the rejection rate 
is not lower than the one stated in the test, in practice 
it is considered reasonable to approximate to the closest 
value when the statistic is not close to the critical val- 
ues). Looking at the quantile values of the row for 1,000 
samples, the observed K, 0.0117, is below 0.0186, the 0.9 
quantile. This means that the observed significance level 
(OSL) is greater than 10%, i.e., in more than 10% of the 
cases where the distribution is an actual power-law, the 
K statistic is greater than 0.0117. Therefore, with an OSL 




number of papers per author 



Fig. 1. Example of log- log plot of papers per author distribu- 
tion for 900 papers in the complex networks field. The circles 
represent the empirical distribution and the line represents the 
MLE estimate of the power-law distribution. Jmle = 2.544. 



greater than 10%, there is insufficient evidence to reject 
the hypothesis that the distribution is a power-law. 

This simple example shows how easy the calculation 
of the MLE estimate and the K statistic is, and how to 
consult the KS table to obtain good basis to confirm or 
reject the power-law distribution hypothesis. 



4 Conclusions 

A simple experiment using a random deviate generator 
shows that linear-fit based methods for estimating the 
power-law exponent tend to produce erroneous results. 
MLE based estimates, which are simple to produce using 
tables or built-in math functions in computational soft- 
ware, provide a more robust estimation of the power-law 
exponent. 

In conjunction with the MLE method, the KS-type 
test table given here can be used to produce a quantita- 
tive assessment of goodness-of-fit, allowing researchers to 
meaningfully assess and compare the appropriateness of 
modeling empirical data as a power-law distribution. 
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